CS计算机代考程序代写 AI Hidden Markov Mode algorithm Formal Language Theory &
 Finite State Automata

Formal Language Theory &
 Finite State Automata
COMP90042
Natural Language Processing
Lecture 13
Semester 1 2021 Week 7 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L13

Methods to process sequence of words: ‣ N-gram language Model
‣ Hidden Markov Model
‣ Recurrent Neural Networks

Nothing is fundamentally linguistic about these models
What Have We Learnt?
2

COMP90042
L13

Studies classes of languages and their computational properties
‣ Regular language (this lecture)
‣ Context free language (next lecture)
Formal Language Theory
3

COMP90042
L13
• •
Formal Language Theory A language = set of strings
A string = sequence of elements from a finite alphabet (AKA vocabulary)
today is a nice day
ai will take over the world one day
interstellar is a great movie trump will never be a vegan

4

COMP90042
L13


Main goal is to solve the membership problem ‣ Whether a string is in a language
How? By defining its grammar
Why?
colourless green ideas
 sleep furiously
?
today is a nice day
ai will take over the world one day
interstellar is a great movie trump will never be a vegan

5

COMP90042
L13
Examples of Language
• Binarystringsthatstartwith0andendwith1 ‣ {01,001,011,0001,…}✓
‣ (1,0,00,11,100,…}✗
• Even-lengthsequencesfromalphabet{a,b} ‣ {aa,ab,ba,bb,aaaa,…}✓
‣ {aaa,aba,bbb,…}✗
• Englishsentencesthatstartwithwh-wordandendin? ‣ {what?,wheremypants?,…}✓
‣ { hello how are you?, why is the dog so cute! }
6

COMP90042
L13
Beyond Membership Problem… Membership
‣ Is the string part of the language? Y/N



Scoring
‣ Graded membership
‣ “How acceptable is a string?” (language models!)
Transduction
‣ “Translate” one string into another (stemming!)
7

COMP90042
L13
• • •
Regular language Finite state acceptor Finite state transducer
Outline
8

COMP90042
L13
Regular Language
9

COMP90042
L13
Regular Language
• Thesimplestclassoflanguages
• Anyregularexpressionisaregularlanguage

Describes what strings are part of the language (e.g. ‘0(0|1)*1’)
10

COMP90042
L13

Formally, a regular expression includes the following operations/definitions:
‣ Symbol drawn from alphabet, Σ
‣ Empty string, ε
‣ Concatenation of two regular expressions, RS ‣ Alternation of two regular expressions, R|S
‣ Kleene star for 0 or more repeats, R*
‣ Parenthesis () to define scope of operations
Regular Languages
11

COMP90042
L13



Examples of Regular Languages Binary strings that start with 0 and end with 1
‣ 0(0|1)*1
Even-length sequences from alphabet {a, b}
‣ ((aa)|(ab)|(ba)|(bb))*
English sentences that start with wh-word and end
in ?
‣ ((what)|(where)|(why)|(which)|(whose)|(whom)) Σ* ?
12

COMP90042
L13


Closure: if we take regular languages L1 and L2 and merge them, is the resulting language regular?

Extremely versatile! Can have RLs for different properties of language, and use them together
Properties of Regular Languages
RLs are closed under the following:
‣ concatenation and union
‣ intersection: strings that are valid in both L1 and L2 ‣ negation: strings that are not in L
13

COMP90042
L13
Finite State Acceptor
14

COMP90042
L13
Finite State Acceptor
Regular expression defines a regular language
But it doesn’t give an algorithm to check whether a string belongs to the language

• •
Finite state acceptors (FSA) describes the computation involved for membership checking
15

COMP90042
L13

FSA consists:
‣ alphabet of input symbols, Σ
‣ set of states, Q

Accepts strings if there is path from q0 to a final state with transitions matching each symbol
‣ Djisktra’s shortest-path algorithm, O(V log V + E)
Finite State Acceptors

‣ transition function: symbol and state → next state
start state, q0 ∈ Q ‣ final states, F ⊆ Q
16

COMP90042
L13



Input alphabet States
Start, final states
{a, b} {q0, q1} q0, {q1}
Example FSA
abb ✓ aba ✗ aab ✓ b✓ aaa ✗
REGUTrLaAnsRitiLoAnNfuGncUtiAonGES{(q0,a) → q0, (q0, b) → q1,
 (q1,b) → q1}


Regular expression defined
 by this FSA?

start
PollEv.com/jeyhanlau569
a
b
q0 b q1
17
Figure 9.1: State diagram for the finite state acce

COMP90042
L13
18

COMP90042
L13

• • • • •
Use of affixes to change word to another grammatical category
grace → graceful → gracefully grace → disgrace → disgracefully allure → alluring → alluringly allure → *allureful
allure → *disallure
Derivational Morphology
19

COMP90042
L13

Fairly consistent process
‣ want to accept valid forms (grace → graceful) ‣ reject invalid ones (allure → *allureful)
‣ generalise to other words, e.g., nouns that behave like grace or allure
FSA for Morphology
20

COMP90042
L13
FSA for Word Morphology
play spare
light
21

COMP90042
L13

Some words are more plausible than others
‣ fishful vs. disgracelyful
‣ musicky vs. writey

Graded measure of acceptability — weighted FSA changes the following:
‣ start state weight function, λ: Q → R ‣ final state weight function, ρ: Q → R ‣ transition function, δ: (Q, Σ, Q) → R
Weighted FSA
22

COMP90042
L13

WFSA Shortest-Path Total score of a path π = t1,…,tN
λ(t0) + ∑N δ(ti) + ρ(tN) i=1
t is an edge
Use shortest-path algorithm to find π with
minimum cost
‣ O(V log V + E), as before


23

COMP90042
L13
Finite State Transducer
24

COMP90042
L13

Often don’t want to just accept or score strings ‣ want to translate them into another string
Finite State Transducers (FST)
25

COMP90042
L13
FSA for Word Morphology
Finite state acceptor: allure + ing = allureing Finite state transducer: allure + ing = alluring
26

COMP90042
L13

FST add string output capability to FSA
‣ includes an output alphabet
‣ transitions now take input symbol and emit output symbol (Q, Σ, Σ, Q)


Can be weighted = WFST
‣ Graded scores for transition
E.g., edit distance as WFST
‣ distance to transform one string to another
Finite State Transducer
27

rtions, deletions, and substitutions. This can be computed by e
r
e
n g
a label x/y : c indicates a cost of c for a transition with input x and outp
N
COMP90042
state transducer, in which the input and output alphabets are L13
onsider the alphabet ⌃ = ⌦ = {a, b}. The edit distance can be Edit Distance Automata
ansducer with the following transitions,
(q,a,a,q) = (q,b,b,q) = 0 202
(q,a,b,q) = (q,b,a,q) = 1
(q,a,✏,q) = (q,b,✏,q) = 1
(q,✏,a,q) = (q,✏,b,q) = 1. input symbol
[9.12]
CHAPTER9. FORMALLA
[9.13]
[9.14]
[9.15]
input = output; no cost
a/a,b/b : 0
delete a character
a/✏, b/✏ : 1 r, there are multiple paths through the transducer: the best-
n in Figure 9.5.
output symbol
start
q
to desert involves a single deletion, for a total score of 1; the
s seven deletions and six additions, for a score of 13.✏/a, ✏/b : 1
ab → bb: 1
ab → aaab: 2 insert a new character
a/b, b/a : 1 a→b or b→a
g algorithm is a “lexicon-free” algorithm for stripping suffixes
Figure 9.5: State diagram for the Levenshtein edit distance finite st
a sequence of character-level rules. Each rule can be described
28

COMP90042
L13
FST for Inflectional Morphology
• VerbinflectionsinSpanishmustmatchthesubjectin person & number
• Goalofmorphologicalanalysis:
canto → cantar+VERB+present+1P+singular

cantar
to sing
1P singular
yo canto
I sing
2P singular
tu cantas
you sing
3P singular
ella canta
she sings
1P plural
nostotros cantamos
we sing
2P plural
vosotros cantáis
you sing
3P plural
ellas cantan
they sing
29

COMP90042
L13
204
start
CHAPTER9. FORMALLANGUAGETHEORY
FST for Spanish Inflection
c/c a/a n/n t/t
✏/a
✏/+Sing
✏/+Sing
o/o
✏/+Noun ✏/+Masc
✏/r ✏/+Verb
✏/+Sing
o/+PresInd ✏/+1p
a/+PresInd ✏/+3p
Figure 9.7: Fragment of a finite state transducer for Spanish morphology. There are two
canto →
accepting paths for the input canto: canto+NOUN+MASC+SING (masculine singular noun,
meaning a song), and cantar+VERB+PRESIND+1P+SING (I sing). There is also an accept- ing path for canta, with output cantar+VERB+PRESIND+3P+SING (he/she sings).
• canto+Noun+Masc+Sing
—fa•ilicngatnotaccre+pVtsetrirnbg+soPrrtreanssIdnudcti+on1sPth+atSarienvgalid.Forexample,apluralization
transducer that does not accept foot/feet would undergenerate. Suppose we “fix” the trans-
ducer to accept this example, but as a side effect, it now accepts boot/beet; the transducer canta → cantar+VERB+PresInd+3P+Sing
would then be said to overgenerate. If a transducer accepts foot/foots but not foot/feet, then
it simultaneously overgenerates and undergenerates.
30

COMP90042
L13
Is natural language regular?
• Yes
• No
• Maybe
• Don’t know
PollEv.com/jeyhanlau569
31

COMP90042
L13
32

COMP90042
L13

Example:


Sometimes…
the mouse that ran.

the cat that killed the mouse that ran.
 …
the lion that bullied the hyena that bit the dog that chased the cat that killed the mouse that ran
Length is unbounded, but structure is local ‣ (Det Noun Prep Verb)*
‣ Can describe with FSA
33

COMP90042
L13

Arithmetic expressions with balanced parentheses ‣ (a+(bx(c/d)))
‣ Can have arbitrarily many opening parentheses
‣ Need to remember how many open parentheses, to produce the same number of closed parentheses
‣ Can’t be done with finite number of states anbn

Non-Regular Languages
34

COMP90042
L13
Center Embedding
• Center embedding of relative clauses
‣ The cat loves Mozart
‣ The cat the dog chased loves Mozart
‣ The cat the dog the rat bit chased loves Mozart
‣ The cat the dog the rat the elephant admired bit chased loves Mozart
• Need to remember the n subject nouns, to ensure n verbs follow (and that they agree etc)
• Requires (at least) context-free grammar (next lecture!)
35

COMP90042
L13
• • • • •
Concept of a language
Regular languages
Finite state automata: acceptors, transducers Weighted variants
Application to edit distance, morphology
Summary
36

COMP90042
L13

Reading E18, Chapter 9.1
37